plotlyThis lab focuses on creating interactive graphics using
plotly, an open-source graphing tool that can interface
with R and ggplot.
Directions (Please read before starting)
\(~\)
This lab will primarily use the plotly package, but will
also require the ggplot2 package.
# load the following packages
# install.packages("plotly")
library(plotly)
library(ggplot2)
The lab’s examples will use the college scorecard data that we’ve previously been working with:
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
\(~\)
Before learning anything about plotly, you should be
aware that it is possible to convert a ggplot object into a
plotly graphic:
## Store a simple ggplot scatter plot
my_ggplot <- ggplot(data=colleges, aes(x=Cost, y=Salary10yr_median, color = Private)) + geom_point()
ggplotly(my_ggplot) ## Convert
The plotly version of this graph includes the following
features:
\(~\)
plot_ly()The code below demonstrates how to use plotly to create
a scatter plot that is colored by a categorical variable:
plot_ly(data = colleges, type = "scatter", mode = "markers",
x = ~Cost, y = ~Salary10yr_median, color = ~Private)
type = "scatter" tells plotly to draw a
scatter plotmode = "markers" plots the data as hover-able dots
(rather than text labels or other symbols)You should notice plotly uses a ~ character
to identify variables from the data provided in the data
argument. If it were omitted, plotly would look for a
vector called “Cost” in your global R environment.
Additionally, you should notice that this code does not perfectly
recreate the graph we made using ggplot and
ggplotly.
\(~\)
plotly and ggplotThe decision to use plotly or ggplot
depends upon the end goal of your visualization, but here are some
factors to consider:
ggplot |
plotly |
|---|---|
| Easier to construct complex graphics | More interactive |
| Easier customization (colors, etc.) | Allows for 3-D graphics |
| More legible syntax and grammar | Allows for animations |
| Annotations and exporting | Can convert ggplot graphics |
\(~\)
plotlySimilar to ggplot, it is possible to build up a
plotly graphic by adding layers via the %>%
operator (akin to the + used with ggplot):
plot_ly(data = colleges) %>%
add_trace(type = "scatter", x = ~Cost, y = ~Salary10yr_median, color = ~Private) %>%
add_text(x = ~Cost, y = ~Salary10yr_median, text = ~State)
The example above creates a scatter plot using
add_trace(), then it draws a layer of text labels on top of
those markers.
The pipe operator allows plotly to be compatible with
data wrangling functions from the dplyr and
tidyr packages:
colleges %>%
filter(State %in% c("IA", "MN", "IL", "WI")) %>%
plot_ly() %>%
add_trace(type = "box", x = ~Cost, y = ~State)
Typically, the first layer of a plotly graphic is
created using add_trace() and the type
argument. Additional layers are created using other add_
functions (such as add_text()). This prevents the less
important layers from interfering with the hover capacity of the
tool-tip.
You can use this reference page for a list of different types of traces (use the navigation drop down menus on the left side of the page).
Question #1: Using add_trace(), create
a violin plot that separately displays the distributions of the variable
“Enrollment” for private and public colleges in the “colleges” data set.
Hint: Use the reference page linked above to determine the
proper arguments needed to create this type of graph.
\(~\)
Perhaps the most appealing feature of plotly is the
ability to view a label whenever you hover over a data point or area of
interest.
Information can be added to these labels using the text
argument in either plot_ly() or add_trace().
For example, we can add the names of each college to our previous
scatter plot:
plot_ly(data = colleges) %>%
add_trace(type = "scatter", mode = "markers",
x = ~Cost, y = ~Salary10yr_median, color = ~Private, text = ~Name)
Labels are constructed using hypertext markup language (HTML), so their appearance can be modified using HTML commands:
plot_ly(data = colleges) %>%
add_trace(type = "scatter", mode = "markers",
x = ~Cost, y = ~Salary10yr_median, color = ~Private,
text = ~paste0(Name, "<br>", City, ", ", State ))
In this example, paste0() is used to combine fixed
character strings with variable values, and the string
“<br>” is the HTML command used to begin a new
line.
Some other useful HTML commands include:
Question #2: Using the “colleges” data, create a scatter plot of the variables “FourYearComp_Males” and “FourYearComp_Females” that includes a custom label which shows each college’s name in bold text, and also shows on a new line its “PercentFemale” after the character string “percentage female:”.
\(~\)
The plotly package is able to create graphics in
3-dimensions. The code below creates a 3-D scatter plot:
plot_ly(data = colleges, type = "scatter3d", mode = "markers",
x = ~Enrollment, y = ~Cost, z = ~ACT_median)
Because 3-D plotly graphs can be rotated, they tend to
be more effective than 3-D scatter plots generated using other
packages.
A second useful type of 3-D graph that plotly can create
is a surface, which is most often used to display a fitted
regression plane.
As an example, consider a linear regression model that predicts the median 10 year salary of graduates based upon a college’s cost and its admissions rate:
model <- lm(Salary10yr_median ~ Cost + Adm_Rate, data = colleges)
Creating a surface to visualize this model involves preliminary two steps:
## Step 1
xs <- seq(0, max(colleges$Cost, na.rm = TRUE), length.out = 100) # Seq from 0 to max cost
ys <- seq(0, 1, length.out = 100) # Equal length seq for adm rate
grid <- expand.grid(Cost = xs, Adm_Rate = ys) # Grid of every combo
## Step 2
z <- predict(model, newdata = grid) # Predictions across the grid
m <- matrix(z, nrow = 100, ncol = 100, byrow = TRUE) # Store predictions as a matrix
## Graph
plot_ly() %>%
add_trace(type = "scatter3d", x = ~colleges$Cost, y = ~colleges$Adm_Rate,
z = ~colleges$Salary10yr_median, color = I("black")) %>%
add_surface(x = ~xs, y = ~ys, z = ~m, colorscale = "Blues")
This code might seem complicated, but it’s easily adapted to other
models and variables simply by modifying xs,
ys, and model.
Shown below is the regression surface of a generalized additive model, or GAM, a type of models that allows for non-linear relationships between the predictors and the outcome using spline functions:
library(mgcv)
model <- gam(Salary10yr_median ~ s(Cost) + s(Adm_Rate), data = colleges)
Once the new model has been fit, only the matrix of predicted values needs to be updated (since the “x” and “y” variables from the previous example remain unchanged).
z <- predict(model, newdata = grid) # Predictions for every combination in the grid
m <- matrix(z, nrow = 100, ncol = 100, byrow = TRUE) # Store predictions as a matrix
## Graph
plot_ly() %>%
add_trace(type = "scatter3d", x = ~colleges$Cost, y = ~colleges$Adm_Rate,
z = ~colleges$Salary10yr_median, color = I("black")) %>%
add_surface(x = ~xs, y = ~ys, z = ~m, colorscale = "Reds")
Later in the semester we will discuss methods for determining which of these two models should be preferred.
Question #3: Using this section’s code as a
template, add the linear regression surface for the model
Debt_median ~ Net_Tuition + ACT_median to a 3-D scatter
plot that uses “Net_Tuition” as the x-variable and “ACT_median” as the
y-variable. You should use the lm() function to fit this
model prior to Step 2.
\(~\)
Axis labels in plotly can be modified using the
layout() function, while most other scales can be labeled
in the function used to create them:
## Plot of the GAM model - gam(Salary10yr_median ~ s(Cost) + s(Adm_Rate), data = colleges)
## Graph
plot_ly() %>%
add_trace(type = "scatter3d", x = ~colleges$Cost, y = ~colleges$Adm_Rate,
z = ~colleges$Salary10yr_median, color = I("black")) %>%
add_surface(x = ~xs, y = ~ys, z = ~m, colorscale = "Reds", colorbar = list(title = "Salary")) %>%
layout(scene = list(xaxis = list(title = "Cost"),
yaxis = list(title = "Admission Rate"),
zaxis = list(title = "Median 10 year salary")))
Documentation for the full set of options in layout()
can be found
here.
\(~\)
Most plotly graphics can be made into animations by
adding a frame argument, which defines a series of data
snapshots that the animation will progress through.
As an example, the code below creates an animated bar chart showing the populations of US states for each year going from 2010 to 2018:
## Load the data
states <- read.csv("https://remiller1450.github.io/data/state_pops.csv")
## Tidy the data
library(tidyr)
library(stringr)
states_long <- pivot_longer(states, cols = 2:ncol(states), names_to = "Year", values_to = "Population")
states_long$Year <- str_replace(string = states_long$Year, pattern = "X", replace = "")
states_long$State <- str_replace(string = states_long$State, pattern = ".", replace = "")
## Animation
plot_ly(data = states_long, type = "bar",
x = ~reorder(State, X = Population, FUN = min), y = ~Population, frame = ~Year, showlegend = FALSE)
Notice that these data needed to be converted to “long format” for
the column “Year” to be used as the frame argument.
Additionally, reorder() is used to arrange the states by
their initial population (assumed to be their minimum).
Animations can be customized using the animation_opts()
function.
## Fast and bouncy animation
plot_ly(data = states_long, type = "bar",
x = ~reorder(State, X = Population, FUN = min), y = ~Population, frame = ~Year, showlegend = FALSE) %>%
animation_opts(frame = 100, easing = "elastic", redraw = FALSE)
Within animation_opts(), the frame argument
controls the speed at which frames progress. The default is 500
milliseconds, so this animation is 5 times faster than the initial
example.
The easing argument implements a transition between
frames (in this case an elastic bounce). Different easing options are listed
here between lines 68 and 103.
Finally, redraw = FALSE is used to avoid redrawing the
entire plot at each frame. In this example it doesn’t make much of a
difference, but for larger data sets it can greatly reduce lag.
Question #4: The code below reads a data set
compiled by Mother
Jones that aims to document all mass shootings in the United States.
For this question, create an animated plot that displays the yearly
number of fatalities and injuries in these shootings over time. For
reference, a sample animation is included below (yours should be
similar, but it doesn’t need to be identical). Hint: Before
creating the animation you should use group_by(),
summarize(), and pivot_longer() to prepare the
data.
shootings <- read.csv('https://remiller1450.github.io/data/MassShootings.csv')